13 research outputs found

    Space-efficient detection of unusual words

    Full text link
    Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ2log⁥2n)O(\sigma^2\log^2 n) bits, where nn is the length of the string and σ\sigma is the size of the alphabet. The size of the stack is o(n)o(n) except for very large values of σ\sigma. We further improve the algorithm by removing its time dependency on σ\sigma, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur\textit{do not occur} in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

    Minimal Forbidden Factors of Circular Words

    Full text link
    Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language MM, computes a DFA recognizing the language whose set of minimal forbidden factors is MM. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc

    A framework for space-efficient string kernels

    Full text link
    String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the kk-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd)O(nd) time and in o(n)o(n) bits of space in addition to the input, using just a rangeDistinct\mathtt{rangeDistinct} data structure on the Burrows-Wheeler transform of the input strings, which takes O(d)O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of kk, like the kk-mer profile and the kk-th order empirical entropy, and for calibrating the value of kk using the data

    Minimal Absent Words in Rooted and Unrooted Trees

    Get PDF
    We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

    Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

    Get PDF
    Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this article, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process,we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences

    Faster Online Computation of the Succinct Longest Previous Factor Array

    No full text
    We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays

    Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

    Get PDF
    International audienceAn absent (or forbidden) word of a word y is a word that does not occur in y. It is then called minimal if all its proper factors occur in y. There exist linear-time and linear-space algorithms for computing all minimal absent words of y (Crochemore et al. in Inf Process Lett 67:111–117, 1998; Belazzougui et al. in ESA 8125:133–144, 2013; Barton et al. in BMC Bioinform 15:388, 2014). Minimal absent words are used for data compression (Crochemore et al. in Proc IEEE 88:1756–1768, 2000, Ota and Morita in Theoret Comput Sci 526:108–119, 2014) and for alignment-free sequence comparison by utilizing a metric based on minimal absent words (Chairungsee and Crochemore in Theoret Comput Sci 450:109–116, 2012). They are also used in molecular biology; for instance, three minimal absent words of the human genome were found to play a functional role in a coding region in Ebola virus genomes (Silva et al. in Bioinformatics 31:2421–2425, 2015). In this article we introduce a new application of minimal absent words for on-line pattern matching. Specifically, we present an algorithm that, given a pattern x and a text y, computes the distance between x and every window of size |x| on y. The running time is O(σ|y|)O(σ|y|) , where σσ is the size of the alphabet. Along the way, we show an O(σ|y|)O(σ|y|) -time and O(σ|x|)O(σ|x|) -space algorithm to compute the minimal absent words of every window of size |x| on y, together with some new combinatorial insight on minimal absent words
    corecore